Telco Customer Churn Project
1 Telco Customer Churn
1.1 Description
The Telco customer churn data contains information about a telephone company that provided home phone and Internet services to 7043 customers in California at the end of 2017 Quarter 3. It indicates which customers have left, stayed, or signed up for their service.
Studying such data can help companies identify the characteristics of lost customers, identify potential, soon-to-be-lost customers and develop appropriate strategies to retain them.
The dataset is WA_Fn-UseC_-Telco-Customer-Churn.csv.
1.1.1 variables
gender: Female or MaleSeniorCitizen: customer is a senior citizen or not (Yes, No)Partner: customer has a partner or not (Yes, No)Dependents: customer has dependents or not (Yes, No)tenure: number of months the customer has stayed with the companyPhoneService: customer has a phone service or not (Yes, No)MultipleLines: customer has multiple lines or not (Yes, No, No phone service)InternetService: customer’s internet service provider (DSL, Fiber optic, No)OnlineSecurity: customer has online security or not (Yes, No, No internet service)OnlineBackup: customer has online backup or not (Yes, No, No internet service)DeviceProtection: customer has device protection or not (Yes, No, No internet service)TechSupport: customer has tech support or not (Yes, No, No internet service)StreamingTV: customer has streaming TV or not (Yes, No, No internet service)StreamingMovies: customer has streaming movies or not (Yes, No, No internet service)Contract: contract term of the customer (Month-to-month, One year, Two year)PaperlessBilling: customer has paperless billing or not (Yes, No)PaymentMethod: Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)MonthlyCharges: amount charged monthlyTotalCharges: total amount chargedChurn: customer churned or not (Yes or No)
2 Exploratory Data Analysis (EDA)
2.1 What factors influence customer churn?
2.1.1 bar plots for categorical variables
# gender seniorCitizen partner depedents
gender <- ggplot(customer, aes(x=gender, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
labs(title="Gender", x="", y="Percentage") +
theme(legend.position="top")
senior <- ggplot(customer, aes(x=SeniorCitizen, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
scale_x_discrete(labels=c("No" = "Not senior", "Yes" = "Senior")) +
labs(title="SeniorCitizen", x="", y="Percentage") +
theme(legend.position="top")
partner <- ggplot(customer, aes(x=Partner, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
scale_x_discrete(labels=c("No" = "Have no partner", "Yes" = "Have a partner")) +
labs(title="Partner", x="", y="Percentage") +
theme(legend.position="top")
dependent <- ggplot(customer, aes(x=Dependents, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
scale_x_discrete(labels=c("No" = "Have no dependents", "Yes" = "Have dependents")) +
labs(title="Dependent", x="", y="Percentage") +
theme(legend.position="top")
basicInfo <- ggarrange(gender, senior, partner, dependent)
annotate_figure(basicInfo, top = text_grob("Customer Basic Infomation",
color = "Black", face = "bold", size = 14)) In this group, gender and partner do not influence churn percentage. Whether the customers are senior citizens and whether they are dependent or not are two most influential factors.
# PhoneService MultipleLines
phone_service <- ggplot(customer, aes(x=PhoneService, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
scale_x_discrete(labels=c("No" = "Have no phone service", "Yes" = "have phone services")) +
labs(title="Phone Service", x="", y="Percentage") +
theme(legend.position="top")
multiple_lines <- ggplot(customer, aes(x=MultipleLines, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
scale_x_discrete(labels=c("No" = "Have no multiple lines", "Yes" = "Have multiple lines")) +
labs(title="Multiple lines", x="", y="Percentage") +
theme(legend.position="top")
PhoneInfo <- ggarrange(phone_service, multiple_lines)
annotate_figure(PhoneInfo, top = text_grob("Phone Infomation",
color = "Black", face = "bold", size = 14)) We can see that these two factors does not influence the churn percentage a lot, but the columns height are different. Therefore, they are not influential factors.
#InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies
internet_service <- ggplot(customer, aes(x=InternetService, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
labs(title="Internet services", x="Internet company", y="Percentage") +
theme(legend.position="top")
online_security <- ggplot(customer, aes(x=OnlineSecurity, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
labs(title="Online security", x="", y="Percentage") +
theme(legend.position="top")
online_backup <- ggplot(customer, aes(x=OnlineBackup, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
labs(title="Online backup", x="", y="Percentage") +
theme(legend.position="top")
device_protection <- ggplot(customer, aes(x=DeviceProtection, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
labs(title="Device protection", x="", y="Percentage") +
theme(legend.position="top")
tech_support <- ggplot(customer, aes(x=TechSupport, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
labs(title="Tech support", x="", y="Percentage") +
theme(legend.position="top")
InternetInfo <- ggarrange(internet_service, online_security, online_backup, device_protection, tech_support)
annotate_figure(InternetInfo, top = text_grob("Internet Information",
color = "Black", face = "bold", size = 14)) In the internet part, we can see these categorical variables all have influences on churn. However, online security and tech support are two remarkable factors. People with no online security and techn support will churn most.
#StreamingTV StreamingMovies
streaming_TV <- ggplot(customer, aes(x=StreamingTV, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
labs(title="Streaming tv", x="", y="Percentage") +
theme(legend.position="top")
streaming_movies <- ggplot(customer, aes(x=StreamingMovies, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
labs(title="Streaming movies", x="", y="Percentage") +
theme(legend.position="top")
StreamingInfo <- ggarrange(streaming_TV, streaming_movies)
annotate_figure(StreamingInfo, top = text_grob("Streaming Information",
color = "Black", face = "bold", size = 14)) For streaming information, we neglect customers who does not have internet service. Without middle column, we can see that they do not influence churn a lot.
#Contract PaperlessBilling PaymentMethod
contract <- ggplot(customer, aes(x=Contract, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
labs(title="Contract", x="Method", y="Percentage") +
theme(legend.position="top")
paperless_billing <- ggplot(customer, aes(x=PaperlessBilling, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
scale_x_discrete(labels=c("No" = "No paperless billing", "Yes" = "Have paperless billing")) +
labs(title="Paperless billing", x="", y="Percentage") +
theme(legend.position="top")
payment_method <- ggplot(customer, aes(x=PaymentMethod, fill=Churn)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("pink3", "steelblue")) +
scale_x_discrete(labels=c("Bank transfer (automatic)" = "bank transfer", "Electronic check" = "E-check", "Credit card (automatic)" = "credit card")) +
labs(title="Payment method", x="Method", y="Percentage") +
theme(legend.position="top")
PaymentInfo <- ggarrange(contract, paperless_billing, payment_method)
annotate_figure(PaymentInfo, top = text_grob("Payment Information",
color = "Black", face = "bold", size = 14)) For the final part, they are all important factors. In payment method, we can see that the churn percentage of customers who choose to e-check is high. Maybe the online payment system is a trouble, and it influences the experience. Next, our team will introduce numeric variables.
2.1.2 kde plots for numeric variables
tenure_kdeplot <- ggplot(data = customer, aes(x = tenure, color = Churn)) +
geom_density(aes(fill = Churn), alpha = 0.8) +
scale_fill_manual(values=c("pink3", "steelblue")) +
labs(title="KDEplot for tenure") +
labs(x="tenure", y="density") +
theme(legend.position="top")
tenure_kdeplotMonthlyCharges_kdeplot <- ggplot(data = customer, aes(x = MonthlyCharges, color = Churn)) +
geom_density(aes(fill = Churn), alpha = 0.8) +
scale_fill_manual(values=c("pink3", "steelblue")) +
labs(title="KDEplot for MonthlyCharges") +
labs(x="MonthlyCharges", y="density") +
theme(legend.position="top")
MonthlyCharges_kdeplotTotalCharges_kdeplot <- ggplot(data = customer, aes(x = TotalCharges, color = Churn)) +
geom_density(aes(fill = Churn), alpha = 0.8) +
scale_fill_manual(values=c("pink3", "steelblue")) +
labs(title="KDEplot for TotalCharges") +
labs(x="TotalCharges", y="density") +
theme(legend.position="top")
TotalCharges_kdeplotlm1 <- glm(Churn~tenure, family = binomial(link = "logit"), data = customer)
lm2 <- glm(Churn~tenure + MonthlyCharges, family = binomial(link = "logit"), data = customer)
lm3 <- glm(Churn~tenure + MonthlyCharges + TotalCharges, family = binomial(link = "logit"), data = customer)
anovat <- anova(lm1,lm2,lm3, test="LRT")
anovat## Analysis of Deviance Table
##
## Model 1: Churn ~ tenure
## Model 2: Churn ~ tenure + MonthlyCharges
## Model 3: Churn ~ tenure + MonthlyCharges + TotalCharges
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 7030 7176
## 2 7029 6382 1 794 <2e-16 ***
## 3 7028 6376 1 6 0.017 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
xkabledply(anovat)| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 7030 | 7176 | NA | NA | NA |
| 7029 | 6382 | 1 | 794.37 | 0.0000 |
| 7028 | 6376 | 1 | 5.67 | 0.0173 |
Logistic regression is the appropriate regression analysis to predict a binary outcome (the dependent variable) based on a set of independent variables.
Then we compare AUC of model 2 and model 3
prob <- predict(lm2,customer, type = c("response"))
customer$prob <- prob
library(pROC)
g <- roc(Churn ~ prob, data = customer)
plot(g)auc(customer$Churn, prob)## Area under the curve: 0.808
prob1 <- predict(lm3,customer, type = c("response"))
customer$prob <- prob1
library(pROC)
g1 <- roc(Churn ~ prob1, data = customer)
plot(g1)auc(customer$Churn, prob1)## Area under the curve: 0.809
AUC (Area Under The Curve) - ROC (Receiver Operating Characteristics) curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between customer with churn and no churn.
#summary(lm2)
xkabledply(lm2)| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.7909 | 0.0866 | -20.7 | 0 |
| tenure | -0.0550 | 0.0017 | -32.5 | 0 |
| MonthlyCharges | 0.0329 | 0.0013 | 25.3 | 0 |
As we can see from this summary, tenure has negative coefficient with churn and monthlycharges has positive coefficient with churn. It means when we have a customer with lower tenure and high monthlycharges having more probability to churn.
2.1.3 Simple correlations
Since most of variables are factors, it makes more sense to check their Spearman correlations.
customerNum = customer
# convert categorical variable as numeric for spearman method
for(i in 2:21){
# tenure, MonthlyCharges, TotalCharges
if (!(i %in% c(6, 19, 20))){
customerNum[,i] = as.numeric(customerNum[,i])
}
}
#str(customerNum)
# corrplot with spearman method for categorical variables
customercor <- cor(subset(customerNum, select=-c(customerID, prob)), method="spearman")
#customercor
loadPkg("corrplot")
#corrplot.mixed(customercor, tl.pos = "lt", number.cex = .5, tl.cex=0.8)
corrplot(customercor, type="lower", addCoef.col="black", number.cex=0.5, tl.cex=0.7,title="Telco Customer Churn Correlation", mar=c(0,0,1,0))unloadPkg("corrplot")Larger circle means higher correlation. We can see that churn has negative correlation with contract and tenure, which means that customer who stays longer with the company or has a longer contract terms is less likely to churn. Customer who signed up for online security service and has technical support plan is also less likely to churn. So it makes sense that contract and tech support have positive correlation, which means most customers who signed up for a technical support plan also have longer contract term.